Test that TORCH_FEATURE_VERSION guards are used where needed #1

tomerqodo · 2025-11-19T11:13:03Z

Splits each torch library registration in the 2.10 folder into its own file -- I had a script that parsed kernel.cpp to do this but I felt like forcing this responsibility on the user might be less error prone

Compiles each file targetting 2.9 and asserts that compilation fails. (There are 2 2.9 kernels we use as negative tests where compilation is expected to succeed)

Stack from ghstack (oldest at bottom):

This PR outputs chars to stream without building temporary strings. They were modified by (on fish) ``` sed -i -e 's/<< "\([^\\\']\)"/<< \'\1\'/g' (grep '<< "."' -r torch c10 aten -l) ``` and revert some invalid changes. Pull Request resolved: pytorch#167899 Approved by: https://github.com/Skylion007

# Description Fixes pytorch#114850, we will port test utils and schema check to Intel GPU We could enable Intel GPU with following methods and try the best to keep the original code styles: # Changes 1. Get device type with from accelerator and get_devtype helper method 2. Replace the requires cuda statement to device_type. 3. Add HAS_XPU and HAS GPU check to replace some of the HAS_XPU etc. # Notify Pull Request resolved: pytorch#166684 Approved by: https://github.com/ezyang, https://github.com/guangyey Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>

Summary: This diff would be a follow-up diff for D85883723. Test Plan: See D86719598. We are now able to publish the model. Unit test: ``` buck run fbcode//mode/opt -c remoteexecution.local=enabled fbcode//sigmoid/inference/test:test_passes -m ovr_config//triton:experimental -- -r test_triton_hop_cpu ``` Differential Revision: D87091238 Pull Request resolved: pytorch#167862 Approved by: https://github.com/XueningXu

Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: pytorch#167916 Approved by: https://github.com/Skylion007

**Summary:** Optimize scalar welford_reduce implementation, combining Welford algorithm with cascade sum to improve numerical stability. Specifically: 1. Use Welford algorithm to compute mean and variance. 2. Use cascade summation when computing sum over input for both mean and variance. **Example:** Take pytorch#141541 as an example: ``` import torch import torch.nn as nn torch.manual_seed(0) class Model(nn.Module): def __init__(self): super().__init__() self.gn = nn.GroupNorm(num_groups=32, num_channels=32) def forward(self, x): return self.gn(x) model = Model().eval() x = torch.randn(1, 32, 128, 128, 128) with torch.no_grad(): output = model(x) with torch._inductor.config.patch({"cpp.simdlen": 0}): c_model = torch.compile(model) c_output = c_model(x) print(torch.max(torch.abs(output - c_output))) print(torch.allclose(output, c_output, 1.3e-6, 1e-5)) ``` **logs** - before ``` tensor(0.0005) False ``` - After ``` tensor(1.4305e-06) True ``` **Generated code:** - before ``` cpp_fused_native_group_norm_0 = async_compile.cpp_pybinding(['float*', 'float*', 'const float*', 'const float*', 'const float*', 'float*'], ''' #include <torch/csrc/inductor/cpp_prefix.h> extern "C" void kernel(float* in_out_ptr0, float* in_out_ptr1, const float* in_ptr0, const float* in_ptr1, const float* in_ptr2, float* out_ptr2) { auto out_ptr1 = in_out_ptr0; auto out_ptr0 = in_out_ptr1; { #pragma GCC ivdep for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(32L); x0+=static_cast<int64_t>(1L)) { { Welford<float> tmp_acc0 = Welford<float>(); Welford<float> tmp_acc0_arr[4]; for (int i = 0; i < 4; i++) { tmp_acc0_arr[i] = Welford<float>(); } #pragma omp parallel num_threads(4) { int tid = omp_get_thread_num(); Welford<float> tmp_acc0_local = Welford<float>(); #pragma omp for for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(2097152L); x1+=static_cast<int64_t>(1L)) { { { auto tmp0 = in_ptr0[static_cast<int64_t>(x1 + 2097152L*x0)]; tmp_acc0_local = welford_combine(tmp_acc0_local, tmp0); } } } tmp_acc0_arr[tid] = tmp_acc0_local; } for (int tid = 0; tid < 4; tid++) { tmp_acc0 = welford_combine(tmp_acc0, tmp_acc0_arr[tid]); } in_out_ptr1[static_cast<int64_t>(x0)] = tmp_acc0.mean; in_out_ptr0[static_cast<int64_t>(x0)] = tmp_acc0.m2; } } } { #pragma GCC ivdep for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(32L); x0+=static_cast<int64_t>(1L)) { { { auto tmp0 = out_ptr1[static_cast<int64_t>(x0)]; auto tmp6 = in_ptr1[static_cast<int64_t>(x0)]; auto tmp8 = out_ptr0[static_cast<int64_t>(x0)]; auto tmp11 = in_ptr2[static_cast<int64_t>(x0)]; auto tmp1 = static_cast<float>(2097152.0); auto tmp2 = tmp0 / tmp1; auto tmp3 = static_cast<float>(1e-05); auto tmp4 = float(tmp2 + tmp3); auto tmp5 = 1 / std::sqrt(tmp4); auto tmp7 = float(tmp5 * tmp6); auto tmp9 = decltype(tmp8)(-tmp8); auto tmp10 = float(tmp9 * tmp7); auto tmp12 = float(tmp10 + tmp11); in_out_ptr0[static_cast<int64_t>(x0)] = tmp7; in_out_ptr1[static_cast<int64_t>(x0)] = tmp12; } } } } #pragma omp parallel num_threads(4) { int tid = omp_get_thread_num(); { #pragma omp for for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(32L); x0+=static_cast<int64_t>(1L)) { #pragma GCC ivdep for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(2097152L); x1+=static_cast<int64_t>(1L)) { { { auto tmp0 = in_ptr0[static_cast<int64_t>(x1 + 2097152L*x0)]; auto tmp1 = in_out_ptr0[static_cast<int64_t>(x0)]; auto tmp3 = in_out_ptr1[static_cast<int64_t>(x0)]; auto tmp2 = float(tmp0 * tmp1); auto tmp4 = float(tmp2 + tmp3); out_ptr2[static_cast<int64_t>(x1 + 2097152L*x0)] = tmp4; } } } } } } } ''') async_compile.wait(globals()) del async_compile class Runner: def __init__(self, partitions): self.partitions = partitions def recursively_apply_fns(self, fns): new_callables = [] for fn, c in zip(fns, self.partitions): new_callables.append(fn(c)) self.partitions = new_callables def call(self, args): arg0_1, arg1_1, arg2_1 = args args.clear() assert_size_stride(arg0_1, (32, ), (1, )) assert_size_stride(arg1_1, (32, ), (1, )) assert_size_stride(arg2_1, (1, 32, 128, 128, 128), (67108864, 2097152, 16384, 128, 1)) buf0 = empty_strided_cpu((1, 32, 1, 1), (32, 1, 32, 32), torch.float32) buf1 = empty_strided_cpu((1, 32, 1, 1), (32, 1, 32, 32), torch.float32) buf3 = reinterpret_tensor(buf1, (1, 32, 1, 1), (32, 1, 1, 1), 0); del buf1 # reuse buf4 = reinterpret_tensor(buf0, (1, 32, 1, 1), (32, 1, 1, 1), 0); del buf0 # reuse buf5 = empty_strided_cpu((1, 32, 128, 128, 128), (67108864, 2097152, 16384, 128, 1), torch.float32) # [Provenance debug handles] cpp_fused_native_group_norm_0:1 cpp_fused_native_group_norm_0(buf3, buf4, arg2_1, arg0_1, arg1_1, buf5) del arg0_1 del arg1_1 del arg2_1 return (buf5, ) ``` - After ``` cpp_fused_native_group_norm_0 = async_compile.cpp_pybinding(['float*', 'float*', 'const float*', 'const float*', 'const float*', 'float*'], ''' #include <torch/csrc/inductor/cpp_prefix.h> extern "C" void kernel(float* in_out_ptr0, float* in_out_ptr1, const float* in_ptr0, const float* in_ptr1, const float* in_ptr2, float* out_ptr2) { auto out_ptr1 = in_out_ptr0; auto out_ptr0 = in_out_ptr1; { #pragma GCC ivdep for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(32L); x0+=static_cast<int64_t>(1L)) { { Welford<float> tmp_acc0 = Welford<float>(); Welford<float> tmp_acc0_arr[4]; for (int i = 0; i < 4; i++) { tmp_acc0_arr[i] = Welford<float>(); } #pragma omp parallel num_threads(4) { int tid = omp_get_thread_num(); WelfordHelper<float, float, 4096> scalar_welford_helper0(static_cast<int64_t>(524288L)); Welford<float> tmp_acc0_local = Welford<float>(); #pragma omp for for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(2097152L); x1+=static_cast<int64_t>(1L)) { { { auto tmp0 = in_ptr0[static_cast<int64_t>(x1 + 2097152L*x0)]; tmp_acc0_local = welford_combine(tmp_acc0_local, tmp0, &scalar_welford_helper0); } } } tmp_acc0_local = welford_combine(tmp_acc0_local, &scalar_welford_helper0); tmp_acc0_arr[tid] = tmp_acc0_local; } for (int tid = 0; tid < 4; tid++) { tmp_acc0 = welford_combine(tmp_acc0, tmp_acc0_arr[tid]); } in_out_ptr1[static_cast<int64_t>(x0)] = tmp_acc0.mean; in_out_ptr0[static_cast<int64_t>(x0)] = tmp_acc0.m2; } } } { #pragma GCC ivdep for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(32L); x0+=static_cast<int64_t>(1L)) { { { auto tmp0 = out_ptr1[static_cast<int64_t>(x0)]; auto tmp6 = in_ptr1[static_cast<int64_t>(x0)]; auto tmp8 = out_ptr0[static_cast<int64_t>(x0)]; auto tmp11 = in_ptr2[static_cast<int64_t>(x0)]; auto tmp1 = static_cast<float>(2097152.0); auto tmp2 = tmp0 / tmp1; auto tmp3 = static_cast<float>(1e-05); auto tmp4 = float(tmp2 + tmp3); auto tmp5 = 1 / std::sqrt(tmp4); auto tmp7 = float(tmp5 * tmp6); auto tmp9 = decltype(tmp8)(-tmp8); auto tmp10 = float(tmp9 * tmp7); auto tmp12 = float(tmp10 + tmp11); in_out_ptr0[static_cast<int64_t>(x0)] = tmp7; in_out_ptr1[static_cast<int64_t>(x0)] = tmp12; } } } } #pragma omp parallel num_threads(4) { int tid = omp_get_thread_num(); { #pragma omp for for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(32L); x0+=static_cast<int64_t>(1L)) { #pragma GCC ivdep for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(2097152L); x1+=static_cast<int64_t>(1L)) { { { auto tmp0 = in_ptr0[static_cast<int64_t>(x1 + 2097152L*x0)]; auto tmp1 = in_out_ptr0[static_cast<int64_t>(x0)]; auto tmp3 = in_out_ptr1[static_cast<int64_t>(x0)]; auto tmp2 = float(tmp0 * tmp1); auto tmp4 = float(tmp2 + tmp3); out_ptr2[static_cast<int64_t>(x1 + 2097152L*x0)] = tmp4; } } } } } } } ''') async_compile.wait(globals()) del async_compile class Runner: def __init__(self, partitions): self.partitions = partitions def recursively_apply_fns(self, fns): new_callables = [] for fn, c in zip(fns, self.partitions): new_callables.append(fn(c)) self.partitions = new_callables def call(self, args): arg0_1, arg1_1, arg2_1 = args args.clear() assert_size_stride(arg0_1, (32, ), (1, )) assert_size_stride(arg1_1, (32, ), (1, )) assert_size_stride(arg2_1, (1, 32, 128, 128, 128), (67108864, 2097152, 16384, 128, 1)) buf0 = empty_strided_cpu((1, 32, 1, 1), (32, 1, 32, 32), torch.float32) buf1 = empty_strided_cpu((1, 32, 1, 1), (32, 1, 32, 32), torch.float32) buf3 = reinterpret_tensor(buf1, (1, 32, 1, 1), (32, 1, 1, 1), 0); del buf1 # reuse buf4 = reinterpret_tensor(buf0, (1, 32, 1, 1), (32, 1, 1, 1), 0); del buf0 # reuse buf5 = empty_strided_cpu((1, 32, 128, 128, 128), (67108864, 2097152, 16384, 128, 1), torch.float32) # [Provenance debug handles] cpp_fused_native_group_norm_0:1 cpp_fused_native_group_norm_0(buf3, buf4, arg2_1, arg0_1, arg1_1, buf5) del arg0_1 del arg1_1 del arg2_1 return (buf5, ) ``` Pull Request resolved: pytorch#162709 Approved by: https://github.com/CaoE, https://github.com/jansel

) Test Plan: CI Differential Revision: D86211542 Pull Request resolved: pytorch#167799 Approved by: https://github.com/njriasan, https://github.com/eellison

This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: pytorch#167914 Approved by: https://github.com/pytorchbot

Pull Request resolved: pytorch#167198 Approved by: https://github.com/bobrenjc93

This PR fixes a bug where `torch.clamp` on MPS fails when min/max tensors have more dimensions than the input tensor. CPU already supports this broadcasting, but MPS raised a RuntimeError. Example of failing case before the fix: ```python x = torch.randn(2, 3, device="mps") min_t = torch.randn(1, 2, 3, device="mps") max_t = torch.randn(1, 2, 3, device="mps") torch.clamp(x, min=min_t, max=max_t) # RuntimeError ``` After this fix, MPS matches CPU behavior. Fixes pytorch#160734 Pull Request resolved: pytorch#165058 Approved by: https://github.com/malfet

…7734)" This reverts commit 226850c. Reverted pytorch#167734 on behalf of https://github.com/Aidyn-A due to fails on CUDA 12.8 ([comment](pytorch#167734 (comment)))

The PR pytorch#167401 reminded me that the removal of old NVTX interface is long overdue, as the header-only NVTX3 has been around for more than 5 years and is shipped with all CUDA Toolkit versions of 12+. In addition to that, `libnvToolsExt.so` was removed in CUDA Toolkit 13 and onward. Pull Request resolved: pytorch#167637 Approved by: https://github.com/eqy

…device allocator (pytorch#166831) The implementation plan of MemPool for XPU, which is the dependance of [XPUGraph](pytorch#166285), following the [RFC](pytorch#162143). - [ ] ->pytorch#166831 - [ ] pytorch#166833 - [ ] pytorch#166843 Pull Request resolved: pytorch#166831 Approved by: https://github.com/EikanWang, https://github.com/gujinghui Co-authored-by: Eikan Wang <eikan.wang@intel.com>

…lasLtWorkspace" (pytorch#167928) Summary: getCurrentCUDABlasHandle() and getCUDABlasLtWorkspace() use static mutable maps that are not protected from concurrent read-and-write. This leads to crashes. This diff adds mutexes to synchronize access to the static maps. Re-land context: This is a re-land of pytorch#167248. A few issues were addressed: - fix for a bug in fast path: premature return in getCurrentCUDABlasHandle) - fix for test flakiness (pytorch#167884) Test Plan: 1. regression tests: buck2 test \mode/opt //caffe2/test\:test_transformers_cuda https://www.internalfb.com/intern/testinfra/testrun/6192449759713581 2. Use a GPU OD, run multi-threaded tests with TSAN: buck test fbcode//mode/dev-tsan fbcode//caffe2:cuda_cublas_handle_pool_test -- --stress-runs 100 https://www.internalfb.com/intern/testinfra/testrun/14355223937501118 Differential Revision: D87111985 Pull Request resolved: pytorch#167928 Approved by: https://github.com/Skylion007

…rnels (pytorch#158250) Co-authored-by: Nikhil Gupta [nikhil.gupta2@arm.com](mailto:nikhil.gupta2@arm.com) This PR enables the use of KleidiAI INT4 kernels that directly produce BF16 outputs within PyTorch to boost LLM prefill & decode performance **This change improves decode throughput by ~15% & reduces memory required to inference the model by 50%** ### Benchmark Setup ``` Model: meta-llama/Llama-3.1-8B Test Platform: Neoverse V2 ``` ### Detailed Results | Metric | With `--compile` | Without `--compile` | |----------------------------------|---------------------------|---------------------------| | Quantization Scheme | INT4 symmetric channelwise | INT4 symmetric channelwise | | Input Precision | BF16 | BF16 | | Number of Layers Quantized | 32 | 32 | | Average Compression Ratio | 87.49% | 87.49% | | Total Quantization Time (s) | 9.62 | 10.32 | | Compile Time (First) (s) | 134.48 | 1.69 | | Compile Time (Second) (s) | 80.44 | 1.60 | | Compile Time (Subsequent) (s) | 0.19 | 0.22 | | Prefill Tokens | 54 | 54 | | Decoded Tokens | 33 | 33 | | Prefill Time (s) | 0.19 | 0.22 | | Decode Time (s) | 0.76 | 1.38 | | E2E Generation Time (s) | 0.95 | 1.60 | | Prefill Throughput (tokens/s) | 288.13 | 249.91 | | Decode Throughput (tokens/s) | 43.42 | 23.83 | Pull Request resolved: pytorch#158250 Approved by: https://github.com/malfet, https://github.com/aditew01, https://github.com/fadara01 Co-authored-by: Nikhil Gupta <nikhil.gupta2@arm.com> Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>

This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned xla hash. Pull Request resolved: pytorch#167968 Approved by: https://github.com/pytorchbot

Update the torch-xpu-ops commit to [intel/torch-xpu-ops@1e69f4](intel/torch-xpu-ops@1e69f40), includes: - Add PTL in the default AOT target list for both Win and Lin - Use PyTorch p2p API in Copy kernel - Add event cache and event timing to XCCL - Add Float8_e8m0fnu support for copy - Add CMAKE_SYCL_COMPILER_LAUNCHER for sccache Pull Request resolved: pytorch#167698 Approved by: https://github.com/EikanWang

Exposing `_inductor.config.bucket_all_reduces_fx` similar to all_gathers, reduce_scatters with only option "all". Pull Request resolved: pytorch#167634 Approved by: https://github.com/eellison

Make the PyObject preservation scheme thread-safe with free threaded (nogil) Python. The general idea is: * Python Tensor and Storage objects always hold a strong reference to their underlying c10 object * c10 objects hold a strong reference to their Python objects if there's at least one other reference to the c10 object This is implemented in `intrusive_ptr`: * The top most bit (`kHasPyObject`) from the weakref count is now used to indicate if the `intrusive_ptr_target` has an associated PyObject. So `kHasPyObject` is one bit, the weakref count is now 31 bits and the strong refcount remains 32 bits. * When the reference count increases from one to two and `kHasPyObject` is set, we incref the associated Python object to ensure that it's kept alive. * When the reference count decreases from two to one (i.e., there are no C++ reference to the `intrusive_ptr_target` other than from the Python object), we decre the associated Python object to break the cycle. Other benefits: * We can delete a lot of the copypasta from Python internal `subtype_dealloc` * This fixes the weakref and GC bugs we had in the previous scheme. Python weakrefs on Tensors and Storages should just work as expected now. Risks: * Extra branch for reference count operations on `intrusive_ptr<TensorImpl>`, `intrusive_ptr<StorageImpl>`, and the generic `intrusive_ptr<intrusive_ptr_target>` even when we're not using Python. * It's a big change (Second attempt at pytorch#166342) Pull Request resolved: pytorch#167564 Approved by: https://github.com/albanD, https://github.com/Skylion007

Previously we hard failed if pg was "gloo". Fallback on hardcoded formulas. Pull Request resolved: pytorch#167827 Approved by: https://github.com/eellison

pytorch#166044 removes openblas from whl dependency list for AArch64+CPU build so this PR adds it back. Only affects CPU build since AArch64+CUDA uses NVPL. Pull Request resolved: pytorch#167841 Approved by: https://github.com/tinglvv, https://github.com/malfet

Use standard HIP headers for unsafeAtomicAdd. Removes copy/paste of unsafeAtomicAdd as "preview" implementation for gfx942. Pull Request resolved: pytorch#167661 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>

@albanD

…rch#165067)" This reverts commit 96a4c4b. Reverted pytorch#165067 on behalf of https://github.com/jeanschmidt due to breaks internal tests see D87036515, @albanD please help the author get this PR merged ([comment](pytorch#165067 (comment)))

@Skylion007

This reverts commit e20ca3b. Reverted pytorch#167049 on behalf of https://github.com/jeanschmidt due to breaks internal tests see D87120562, @Skylion007 please thelp the author get this PR merged ([comment](pytorch#167049 (comment)))

This reverts commit 2245d7d. Reverted pytorch#167899 on behalf of https://github.com/jeanschmidt due to need to revert in order to revert pytorch#167899 ([comment](pytorch#167899 (comment)))

@Skylion007

This reverts commit deabb3e. Reverted pytorch#167821 on behalf of https://github.com/jeanschmidt due to Breaks internal tests, see D87148810. @Skylion007 may you help the author to get this PR merged? ([comment](pytorch#167821 (comment)))

Alas, one can not use `repeat_interleave_common` for MPS tensors, as `data_offset` is not a valid pointer to `id<MTLTensor>` On the other hand, one does not need to use `AT_DISPATCH_INDEX_TYPES` as dispatching is happening on the shader side Fixes pytorch#167924 Pull Request resolved: pytorch#167961 Approved by: https://github.com/manuelcandales

@nWEIdia

Summary: MXFP4 unit tests pass on B200, fail on RTX 5090 - disable non-B200 cases. Also add a fail w/a not implemented error for non-B200 to avoid unhelpful failure messages. Test Plan: ``` pytest -sv -k "mxfp4" test/test_scaled_matmul_cuda.py ``` Reviewers: @nWEIdia Subscribers: Tasks: Fixes pytorch#167850 Tags: Signed-off-by: Simon Layton <simonlayton@meta.com> Pull Request resolved: pytorch#167857 Approved by: https://github.com/nWEIdia, https://github.com/malfet

Upgrade all the ROCm docker images to ROCm 7.1 release version. Pull Request resolved: pytorch#166743 Approved by: https://github.com/atalman, https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com> Co-authored-by: Prachi Gupta <prachi.gupta@amd.com>

…7860) getAllOperatorsFor returns a const reference to internal state that is protected by a lock. Presuming that the lock is necessary in the first place (about which I offer no opinion because it's unclear to what extent the GIL should help here), this is a straightforward way to cause callers to create race conditions. This should fix those race conditions by copying the state instead. I modified calling code to stop binding a const reference to the result for clarity. Differential Revision: [D87088731](https://our.internmc.facebook.com/intern/diff/D87088731/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D87088731/)! Pull Request resolved: pytorch#167860 Approved by: https://github.com/zou3519

…ytorch#161728) Resolves pytorch#161290 ## Summary Expands `dynamo/check_perf_csv.py` output capabilities with latency, compile time and memory information: - Display's measured speedup and display % from target - Added clear messaging for all passing model tests when no regression is found - Added error handling if csv file is missing ### Example (Failing Check) ```bash python benchmarks/dynamo/check_perf_csv.py -f reports-dir/inductor_training_smoketest.csv -t 1.40 ``` **Example Output:** ``` Checking inductor_training_smoketest.csv (speedup threshold >= 1.40x) hf_Bert speedup=1.005x, latency=390.8 ms/iter, compile=1.526s, mem_ratio=1.02x (eager=360.6 GB, dynamo=369.3 GB) Error 1 model(s) performance regressed hf_Bert - hf_Bert: 1.005x (< 1.40x; -28.2% from target) ``` ### Example (Passing Check) ```bash python benchmarks/dynamo/check_perf_csv.py -f reports-dir/inductor_training_smoketest.csv -t 1.40 ``` **Example Output:** ``` Checking inductor_training_smoketest.csv (speedup threshold >= 1.00x) hf_Bert speedup=1.005x, latency=390.8 ms/iter, compile=1.526s, mem_ratio=1.02x (eager=360.6 GB, dynamo=369.3 GB) All 1 model(s) passed threshold check (>= 1.00x) ``` Pull Request resolved: pytorch#161728 Approved by: https://github.com/isuruf

This reverts commit 99fdca8. Reverted pytorch#166492 on behalf of https://github.com/jeanschmidt due to Internally we still depends on the old logic, so we need to find a way to maintain backwards compatibility, for now ([comment](pytorch#166492 (comment)))

Pull Request resolved: pytorch#167772 Approved by: https://github.com/janeyx99

…orch::stable::Tensor. (pytorch#161891) This ghstack is a prerequisite for porting torchaudio C++ extensions to use torch stable ABI, see pytorch/audio#4074, pytorch/audio#4075, pytorch/audio#4076, pytorch/audio#4077, pytorch/audio#4078 Pull Request resolved: pytorch#161891 Approved by: https://github.com/mikaylagawarecki ghstack dependencies: pytorch#167772

The following tests are failing on python 3.14 on linux machine * TestSetAffinity::test_set_affinity_in_worker_init * Why? 3.14 makes `forkserver` the default start method for multiprocessing. With it, local functions are not pickle-able and unit test fail. * TestIndividualWorkerQueue::test_ind_worker_queue * Why? The test was hitting timeout. This is also related to the start method. I am increasing timeout and reducing batch size iterations to reduce total unit test time. * Fixes pytorch#68643 Pull Request resolved: pytorch#167429 Approved by: https://github.com/aelavender, https://github.com/ramanishsingh

This reverts commit 77acc66. Reverted pytorch#166743 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](pytorch#166743 (comment)))

…7633) Pull Request resolved: pytorch#167633 Approved by: https://github.com/eellison ghstack dependencies: pytorch#167827

Not sure if the path are already properly setup so I can call 'benchmarks/dynamo/huggingface.py' in unit test directly. Let's tell from CI. Pull Request resolved: pytorch#167482 Approved by: https://github.com/v0i0, https://github.com/mlazos

Inductor may treat an outer reduction as inner reduction when the reduction ranges contains a 1. This cause some weird issue that we skip fusing with mix order reduction. While I'm still debugging why that happens, I think we should fix the decision here anyways Pull Request resolved: pytorch#167697 Approved by: https://github.com/jansel, https://github.com/v0i0

@drisspg

Fixes pytorch#158429 Updated LogAddExpKernel.cu to allow for complex numbers. Also, updated unittest to run test_logaddexp on CUDA with complex data types and added a unit test in test_linalg.py to compare results between CUDA and cpu. @drisspg Pull Request resolved: pytorch#163509 Approved by: https://github.com/isuruf

Enables mm out for sparse tensors Pull Request resolved: pytorch#167908 Approved by: https://github.com/malfet

…#167931) Per title 1) allows `self` argument to have the same precision as output 2) fixes broadcasting of `self` argument - it used to allocate incorrectly sized output and resize it later, causing a warning, in addmm, and error out in baddbmm 3) fixes `out` handling for `out` baddbmm overload, where the implementation used uninitialized memory in `out` instead of copying `self` to out. 4) removes couple unneeded iife patterns Pull Request resolved: pytorch#167931 Approved by: https://github.com/PaulZhang12, https://github.com/drisspg, https://github.com/malfet

…idiAI kernels (pytorch#158250)" This reverts commit 53809f9. Reverted pytorch#158250 on behalf of https://github.com/zou3519 due to reverting to see if it fixes inductor halide test failure ([comment](pytorch#158250 (comment)))

Summary: add support for symint placeholders added two test cases with dynamic reshape - dynamic info coming from tmd on placeholders - dynamic info coming from placeholders (symints) Test Plan: test_reshape_dynamic_ph test_reshape_dynamic_tmd Differential Revision: D86984100 Pull Request resolved: pytorch#167757 Approved by: https://github.com/blaine-rister

…locate test into `TestSaveLoad` (pytorch#158247) This is a follow-up to [pytorch#154333](pytorch#154333), where I initially introduced a fallback mechanism in deserialize_torch_artifact. In this revised PR: Cleaned up commit history for clarity and reproducibility. Relocated the test into the TestSaveLoad class in test_serialize.py. There were some issues with last PR so opened this PR The previous PR had inconsistencies due to local branch issues and was closed in favor of this cleaner submission. Feedback is very welcome Pull Request resolved: pytorch#158247 Approved by: https://github.com/angelayi

This reverts commit 99117c1. Reverted pytorch#167637 on behalf of https://github.com/yangw-dev due to breaks internal build with torch/csrc/profiler/stubs/cuda.cpp:4:10: fatal error: 'nvtx3/nvtx3.hpp' file not found 4 | #include <nvtx3/nvtx3.hpp>, please find a meta fella to resolve this issue and try again, diff:[D87229660] ([comment](pytorch#167637 (comment)))

This reverts commit 7ede33b. Reverted pytorch#167771 on behalf of https://github.com/eellison due to needs one fix ([comment](pytorch#167771 (comment)))

… used where needed" Splits each torch library registration in the 2.10 folder into its own file -- I had a script that parsed kernel.cpp to do this but I felt like forcing this responsibility on the user might be less error prone Compiles each file targetting 2.9 and asserts that compilation fails. (There are 2 2.9 kernels we use as negative tests where compilation is expected to succeed) [ghstack-poisoned]

bar-qodo · 2025-11-19T13:06:07Z

/agentic_review

cyyever and others added 30 commits November 16, 2025 07:19

s/Stragety/Strategy/ (pytorch#167916)

363385a

Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: pytorch#167916 Approved by: https://github.com/Skylion007

Enable PyTorch OSS numerics changes, inductor heuristics (pytorch#167799

d8ce6f8

) Test Plan: CI Differential Revision: D86211542 Pull Request resolved: pytorch#167799 Approved by: https://github.com/njriasan, https://github.com/eellison

deprecate check_is_size and guard_size_oblivious (pytorch#167198)

f2e6f94

Pull Request resolved: pytorch#167198 Approved by: https://github.com/bobrenjc93

Revert "[ATen][CUDA] Add sm_121a flag for RowwiseScaledMM (pytorch#16…

b9bccec

…7734)" This reverts commit 226850c. Reverted pytorch#167734 on behalf of https://github.com/Aidyn-A due to fails on CUDA 12.8 ([comment](pytorch#167734 (comment)))

[inductor] Expose config for fx bucket all_reduces (pytorch#167634)

9ff95f6

Exposing `_inductor.config.bucket_all_reduces_fx` similar to all_gathers, reduce_scatters with only option "all". Pull Request resolved: pytorch#167634 Approved by: https://github.com/eellison

Do not hardfail on use nccl estimations for non-nccl (pytorch#167827)

2f74916

Previously we hard failed if pg was "gloo". Fallback on hardcoded formulas. Pull Request resolved: pytorch#167827 Approved by: https://github.com/eellison

Revert "Improve char printing (pytorch#167899)"

22ccd44

This reverts commit 2245d7d. Reverted pytorch#167899 on behalf of https://github.com/jeanschmidt due to need to revert in order to revert pytorch#167899 ([comment](pytorch#167899 (comment)))

Revert "Use c10::filesystem (pytorch#167821)"

a4c7bf7

This reverts commit deabb3e. Reverted pytorch#167821 on behalf of https://github.com/jeanschmidt due to Breaks internal tests, see D87148810. @Skylion007 may you help the author to get this PR merged? ([comment](pytorch#167821 (comment)))

pytorchmergebot and others added 18 commits November 17, 2025 17:59

Move isQIntType to headeronly (pytorch#167772)

02b55c3

Pull Request resolved: pytorch#167772 Approved by: https://github.com/janeyx99

Cleanup in inductor usage of nccl estimator after its fix (pytorch#16…

4414e1b

…7633) Pull Request resolved: pytorch#167633 Approved by: https://github.com/eellison ghstack dependencies: pytorch#167827

[MPS] mm out sparse (pytorch#167908)

a892f76

Enables mm out for sparse tensors Pull Request resolved: pytorch#167908 Approved by: https://github.com/malfet

Revert "Tiling bug fix (pytorch#167771)"

1c04a43

This reverts commit 7ede33b. Reverted pytorch#167771 on behalf of https://github.com/eellison due to needs one fix ([comment](pytorch#167771 (comment)))

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Test that TORCH_FEATURE_VERSION guards are used where needed #1

Test that TORCH_FEATURE_VERSION guards are used where needed #1

Uh oh!

tomerqodo commented Nov 19, 2025

Uh oh!

bar-qodo commented Nov 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

36 participants

Test that TORCH_FEATURE_VERSION guards are used where needed #1

Are you sure you want to change the base?

Test that TORCH_FEATURE_VERSION guards are used where needed #1

Uh oh!

Conversation

tomerqodo commented Nov 19, 2025

Uh oh!

bar-qodo commented Nov 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

36 participants